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ABSTRACT 

Recently, MapReduce based spatial query systems have emerged 
as a cost effective and scalable solution to large scale spa¬ 
tial data processing and analytics. MapReduce based sys¬ 
tems achieve massive scalability by partitioning the data and 
running query tasks on those partitions in parallel. There¬ 
fore, effective data partitioning is critical for task paral¬ 
lelization, load balancing, and directly affects system perfor¬ 
mance. However, several pitfalls of spatial data partitioning 
make this task particularly challenging. First, data skew 
is very common in spatial applications. To achieve best 
query performance, data skew need to be reduced to the 
minimum. Second, spatial partitioning approaches generate 
boundary objects that cross multiple partitions, and add 
extra query processing overhead. Therefore, boundary ob¬ 
jects need to be minimized. Third, the high computational 
complexity of spatial partitioning algorithms combined with 
massive amounts of data require an efficient approach for 
partitioning to achieve overall fast query response. In this 
paper, we provide a systematic evaluation of multiple spa¬ 
tial partitioning methods with a set of different partition¬ 
ing strategies, and study their implications on the perfor¬ 
mance of MapReduce based spatial queries. We also study 
sampling based partitioning methods and their impact on 
queries, and propose several MapReduce based high perfor¬ 
mance spatial partitioning methods. The main objective of 
our work is to provide a comprehensive guidance for optimal 
spatial data partitioning to support scalable and fast spatial 
data processing in distributed computing environments such 
as MapReduce. The algorithms developed in this work are 
open source and can be easily integrated into different high 
performance spatial data processing systems. 

1. INTRODUCTION 

The proliferation of ubiquitous positioning technology, mo¬ 
bile devices, and the rapid improvement of high resolution 
data acquisition technologies enabled us to collect massive 


amounts of spatial data in a way that was never before pos¬ 
sible. The volume and velocity of data only increase signifi¬ 
cantly as we shift towards the Internet of Things paradigm 
in which devices have spatial awareness and produce data 
while interacting with each other. As science and businesses 
are becoming increasingly data-driven, timely analysis and 
management of such data is of utmost importance to data 
owners. A wide spectrum of applications and scientific disci¬ 
plines such as GIS, Location Based Social Networks (LBSN), 
neuroscience [4], medical imaging [38] and astronomy [27], 
can benefit from an efficient spatial query system to cope 
with the challenges of Spatial Big Data. 

To effectively store, manage and process such large amounts 
of spatial data, a scalable distributed data management sys¬ 
tem is essential. Recently, the MapReduce framework [9] 
has become the de facto standard for handling large scale 
data processing tasks, and it has many salient features such 
as massive scalability, fault-tolerance, easy programmability 
and low deployment cost. With the success of MapReduce, a 
number of spatial query systems [5, 23, 30] and frameworks 
[6, 12] have emerged to enable large scale spatial query pro¬ 
cessing on MapReduce and cloud platforms. 

Data partitioning is a powerful mechanism for improving 
efficiency of data management systems, and it is a standard 
feature in modern database systems. In fact, state-of-the- 
art systems employ a shared-nothing architecture [36], and 
both MapReduce and parallel DBMS are examples of such 
architecture. Aside from the fact that data partitioning 
improves the overall manageability of large datasets, it im¬ 
proves query performance in two ways. First, partitioning 
the data into smaller units enables processing of a query in 
parallel, and henceforth the improved throughput. Second, 
with a proper partitioning schema, I/O can be significantly 
reduced by only scanning a few partitions that contain rel¬ 
evant data to answer the query. Therefore, a partitioning 
approach - that evenly distributes the data across nodes and 
facilitates parallel processing - is essential for achieving fast 
query response and optimal system performance. 

Spatial data partitioning, however, is particularly chal¬ 
lenging due to several pitfalls that are endemic to spatial 
data and query processing. 

Spatial Data Skew. Data skew is very common and se¬ 
vere in spatial applications. For example, in microscopic 
pathology imaging scenario, tumorous tissues contain far 
more spatial objects (segmented cells), whereas cells are 
more evenly distributed in healthy tissues. In geospatial ap¬ 
plications (e.g., OpenStreetMap) some countries and regions 
have more detailed mapping information due to the enthu- 
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siastic data contributors. For example, if OpenStreetMap 
is partitioned into 1000 x 1000 fixed size tiles, the number 
of objects contained in the most skewed tile is nearly three 
orders of magnitude more than the one in an average tile. 
Needless to say, data skew is detrimental to the query per¬ 
formance [35] and curtails system scalability [29]. Therefore, 
to achieve the best query performance, a spatial partition ap¬ 
proach should try to avoid a skewed partitioning whenever 
it is possible. 

Boundary Objects. Spatial partitioning approaches gen¬ 
erate boundary objects that cross multiple partitions, thus 
violating the partition independence. As spatial objects 
have complex boundary and extent, imposing a rectangu¬ 
lar region based partitioning on sufficiently large dataset 
would most certainly produce objects that cross multiple 
partition boundary. Spatial query processing algorithms get 
around the boundary problem by using a replicate-and-filter 
approach [29, 39] in which boundary objects are replicated 
to multiple spatial partitions, and side effects of such repli¬ 
cation is remedied by filtering the duplicates at the end of 
the query processing phase. This process adds extra query 
processing overhead which increases along with the volume 
of boundary objects. Therefore, a good spatial partitioning 
approach should aim to minimize the number of boundary 
objects. 

Performance. Spatial partitioning algorithms are expen¬ 
sive to compute compared to the conventional one dimen¬ 
sional table partitioning algorithms, such as hash and range 
partitioning, that can be done quickly on the fly. The mul¬ 
tidimensional nature of spatial data entails that most spa¬ 
tial operators are of linear time complexity. The high com¬ 
putational complexity combined with massive amounts of 
data require an efficient approach for spatial partitioning to 
achieve overall fast query response. This is in particularly 
important for spatial-temporal data where new spatial data 
has to be partitioned and processed in a timely fashion. 

To the best of our knowledge, no spatial database system 
provides a graceful approach to spatial partitioning. Previ¬ 
ously, Paradise [29] - a parallel spatial database system - 
used a regular fixed grid partitioning for parallel join pro¬ 
cessing. Fixed grid partitioning is the basis of many spatial 
algorithms and it is easy to compute. However, as men¬ 
tioned in the original work, fixed grid approach suffers from 
both data skew problem and boundary object problem. 

In the relevant research literature, some of those chal¬ 
lenges are given some attention in various contexts. How¬ 
ever, in most cases the problem is not fully explored, or 
circumvented by providing domain specific ad-hoc fixes. To 
fill those gaps, a systematic and detailed study of spatial 
partition approaches for parallel spatial query processing is 
needed. In this paper, we provide a detailed study of a 
set of six spatial partitioning approaches within a MapRe¬ 
duce based spatial query processing framework [6]. The 
approaches described here can be reused for any other dis¬ 
tributed or parallel spatial query processing systems beyond 
MapReduce that can take advantage of data partitioning. 
We provide parallelization strategies to improve the parti¬ 
tioning algorithm efficiency which can improve the efficiency 
of a number of tasks such as data loading and ad-hoc query 
processing. In summary, our main contributions are as fol¬ 
lows: 

1. To the best of our knowledge, this is the first formal at¬ 
tempt to introduce and address the spatial data partitioning 


problem for parallel query processing. 

2. We present six spatial partitioning algorithms in de¬ 
tails, and provide a general classification of those approaches 
along three dimensions. 

3. We systematically study various properties of the pre¬ 
sented spatial partitioning algorithms and their effects on 
query performance, and provide a comprehensive empirical 
evaluation on two large scale real-world datasets. 

4. We propose MapReduce based algorithms for parallel 
spatial partitioning, and evaluate their performance in de¬ 
tails. 

2. BACKGROUND 

2.1 Spatial Query Processing with MapReduce 

Recently, several MapReduce based spatial query systems 
[5, 12] have emerged to support scalable spatial query pro¬ 
cessing on large datasets. While these systems may vary 
in implementation details and at the query language layer, 
conceptually they are very similar. Algorithm 1 sketches 
out HadoopGIS - a general MapReduce based spatial query 
processing framework [6]. As the algorithm shows, data is 
spatially partitioned and staged to HDFS; spatial queries 
are expressed as a set of operators that can be translated to 
MapReduce tasks during runtime. Tasks run on the parti¬ 
tioned input for parallel query processing. Queries are im¬ 
plicitly parallelized through MapReduce, and a tile (as spa¬ 
tial partitioning closely resembles tiling of two dimensional 
space, we use tile and spatial partition interchangeably here¬ 
after) is the parallelization unit that a Mapper/Reducer can 
process independently. 


Algorithm 1: MapReduce based spatial query process¬ 
ing framework 

1 A. Data/space partitioning; 

2 B. Staging of partitioned data to HDFS; 

3 C. Pre-query processing (optional); 

4 D. for tile in input do 

5 Index building for objects in the tile; 

6 Tile based spatial querying processing; 

7 E. Boundary object handling; 

s F. Post-query processing (optional); 


Fig. 1 shows a simple example where the dataset is par¬ 
titioned into four tiles (dotted lines depict partition bound¬ 
ary). To process a spatial join query such as find object pairs 
that intersect with each other from two datasets, a single 
MapReduce job can be started where each tile is processed 
by a single mapper (Ti, T 2 , T 3 , Tfi). In this example, tile 3 
contains more objects than other tiles, and consequently re¬ 
quires more processing time. As a result, the corresponding 
MapReduce task (T 3 ) becomes a straggler task - a perfor¬ 
mance bottleneck in MapReduce based queries. 

Given a spatial dataset R, ideally, we would like to derive 
a spatial decomposition of R = L>i =1 Ri where f?; (~l Rj = 0 
for i ^ j, such that query tasks can be performed on each 
partition Ri independently in parallel. 

2.2 Boundary Objects 
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Figure 1: An example of MapReduce based spatial 
query parallelization 


There is one specific problem that is endemic to spatial 
partitioning - boundary objects. Due to their multidimen¬ 
sional nature, some spatial objects may span more than one 
partition. For example, in Fig. 1 the big round object in the 
middle crosses all tile boundaries. As a result, Ri n Rj ^ 0 
for i ^ j, and it requires the spatial query processing frame¬ 
work to be able to handle such cases. 

Parallel spatial query processing algorithms remedy the 
boundary object problem in two ways. Namely multi-assignment, 
single-join (MASJ) and single-assignment, multi-join (SAMJ) 

[39, 22], In MASJ approach, each boundary object is repli¬ 
cated to each tile that overlaps with the object. During 
the query processing phase, each partition is processed only 
once without considering the boundary objects. Then a de- 
duplication step is initiated to remove the redundancies that 
resulted from the replication. However, in SAMJ approach, 
each boundary object is only assigned to one tile. Therefore, 
during the query processing phase, each tile is processed mul¬ 
tiple times to account for the boundary objects. 

Both approaches introduce extra query processing over¬ 
head. In MASJ, the replication of boundary objects incurs 
extra storage cost and computation cost. In SAMJ, how¬ 
ever, only extra computation cost is incurred by processing 
the same partition multiple times. Hadoop-GIS takes the 
MASJ approach [6] and the original work pointed out that: 

(a) In practice, the MASJ approach is proven to be signif¬ 
icantly efficient than the SAMJ approach [39]; (b) MASJ 
approach allows higher degree of parallelization such that, 
for large datasets, the query processing efficiency can be 
greatly improved, and de-duplication cost can be diminish- 
ingly small. 

2.3 Query Processing Cost Model 

In Hadoop-GIS framework, the cost of processing a query 
includes non-boundary objects processing cost, duplicated 
boundary objects processing cost, and object de-duplication 
cost. A modeling approach can help us better understand 
overall query processing overhead and provide principled 
guidelines for optimizing spatial partitioning for optimal 
query performance. 

Given two datasets R and S (round objects and polygo¬ 
nal objects in Fig 1), a spatial join query finds all the pairs 
ri £ R, Sj £ S that satisfies a spatial topology relation¬ 
ship F(ri,Sj) = 1. Selection of the spatial topology func¬ 
tion can be arbitrary and without loss of generality, we use 
st_intersects as an example throughout the paper. Let us 


assume that, datasets are merged and co-partitioned with a 
partition schema which results in partitions of R = U; = i Ri 
and S = ujLiS'i. Following the MapReduce based query 
processing framework , we have: 

k 

R & S = (J Ri Si (1) 

i= 1 

A few assumptions can help us simplify the analysis, (a) Each 
dataset follows a uniform distribution. Consequently, ig¬ 
noring boundary objects, each partition contains roughly 
\R\/k and \S\/k objects, (b) For each partition, the fraction 
of boundary objects is a. Hence, each partition contains 
(1 + a) * |i?|/fc objects as a result of boundary object repli¬ 
cation. (c) Overall query processing cost is the sum of par¬ 
titioned query processing cost Ci, and data deduplication 
cost C 2 which is a fixed amount (/3(\R\ + |Sj) that bound by 
the dataset size [6]. After we plug in the cost functions into 
the equation (1), the overall query processing cost is: 


C{R £ S) 


Y,Ci(Ri£iSi) + C2 

2 = 1 

V (1 + «)l^l 0- + a)\S\ +0m + | 5)) 
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k 


+ 0(|fl| + |S|) 


The cost model provides us an important insight - par¬ 
tition granularity is a double-edged sword. On one hand, 
a finer level of partitioning (larger k) improves the query 
performance. On the other hand, a finer level of partition¬ 
ing generates a larger fraction of boundary objects (larger 
a), and consequently it is detrimental to the query perfor¬ 
mance. Clearly, there is a sweet spot for the partition granu¬ 
larity which yields the best query performance. Finding the 
optimal partition granularity is non-trivial as it depends on 
the dataset characteristics and query type, and we plan to 
explore this problem separately in our future work. 


3. CLASSIFICATION OF SPATIAL PARTI¬ 
TION ALGORITHMS 

In this paper we study six spatial partition algorithms that 
are representative of different classes of approaches. Before 
we delve into the technical details, it would be more interest¬ 
ing to give a high-level view to help understand how these 
algorithms are related, and what their major differences are. 
Here, we attempt to categorize those algorithms along three 
dimensions, and Table 1 summarizes such classification. The 
algorithmic details will be discussed in next section. 

3.1 Partition Boundary 

We start with whether the spatial partition boundaries 
overlap with each other. 

Non-overlapping. Algorithms in this category generate 
spatial partitions of which boundaries do not overlap with 
each other. Non-overlapping partitioning is ideal for most 
query processing tasks as it does not incur any extra storage 
or computation overhead other than replicated boundary 
objects. Due to the same reason, in this paper we mostly 
focus on this class of algorithms which includes FG, BSP, 
SLC, and BOS. 


3 




















Dimension 

Category 

BSP 

FG 

SLC 

BOS 

STR 

HC 

Partition Boundary 

overlapping 





/ 

/ 

non-overlapping 

/ 

/ 

/ 

/ 



Search Strategy 

top-down 

/ 

NA 





bottom-up 


NA 

/ 

/ 

/ 

/ 

Split Criterion 

space-oriented 

/ 

/ 





data-oriented 



/ 

/ 

/ 

/ 


Table 1: A general classification of spatial partition algorithms. BSP: Binary split partitioning, FG: Fixed grid 
partitioning, SLC: Strip partitioning, BOS: Boundary optimized strip partitioning, STR: sort-tile-recursive 
partitioning, HC: Hilbert curve partitioning. 


Overlapping. Algorithms in this category relax the non¬ 
overlapping boundary condition, and allow generated parti¬ 
tions to overlap with each other. Most spatial index con¬ 
struction algorithms [32] are based on the similar idea, and 
the packing algorithms such as STR [21] and Hilbert Curve 
[18] belong to this class. Since the partitions may overlap 
with each other, some fraction of objects would be present in 
multiple partitions. Those multi-partition objects would be 
replicated and assigned to each of the overlapping partitions. 
As a result, in this class of approaches the replication factor 
a can be high which consequently increases the deduplica¬ 
tion cost factor /?. However, if a good partitioning can be 
quickly obtained by allowing the partitions to overlap, then 
the extra cost can be compensated by the improved query 
performance. 

3.2 Search Strategy 

The second dimension we consider is the search strategy 
which focuses on how the partitions are generated. 
Top-down. This class of algorithms generate partitions 
in top-down manner. Specifically, given a dataset and an 
expected partition payload b (number of objects assigned 
to that partition), a top-down approach recursively splits 
the dataset into k sub-partitions, and examines if any sub¬ 
partitions has more than b objects. If a sub-partition has 
more than b objects, then it will be further partitioned, until 
the payload requirement is met. Most spatial indices are 
constructed using similar procedure. While the value of the 
parameter k can be chosen arbitrarily, some specific values, 
such as k = 2 (BSP) and k = 4 (Quad-Tree), are used 
more frequently in practice. Depending on the split criterion, 
this class of algorithms can be implemented as either data- 
oriented or space-oriented, and we describe these categories 
in the next subsection. 

Bottom-up. Rather than generating partitions in a recur¬ 
sive manner, this class of algorithms attempt to construct 
the final partitions as early as possible. Such approach bears 
some resemblance to the spatial packing algorithms. The 
general idea is to use proximity information of spatial ob¬ 
jects to group them into partitions. Since there is no spatial 
proximity preserving total ordering for multi-dimensional ob¬ 
jects, Space Filling Curves are used to generate approximate 
one dimensional ordering. Then, objects are packed into par¬ 
titions by grouping them based on such ordering. 

3.3 Partition Criterion 

Finally, splitting an oversized partition into smaller ones 
is a core subroutine in spatial partitioning, and algorithms 
may have different criterion for this task. For example, con¬ 
sider a simple case where a partition with payload w need 


to be partitioned into two sub-partitions. There will be two 
strategies: space oriented, and data oriented. 

Space Oriented. This class of algorithms generate sub¬ 
partitions by spatially decomposing the current partition bound¬ 
ary into two equal sub-spaces. As the split decision is made 
solely based on the space, this approach suffers from data 
skew. If the data distribution is uniform, we would expect 

to get two sub-partitions where each of them has a payload 
w 

of roughly —. However, if the data distribution is skewed, it 

is possible that one of the subpartitions still contains large 
fraction of objects in the original partitions, while the other 
contains only few objects. 

Data Oriented. This class of algorithms generate sub¬ 
partitions by finding a cut such that each resulting sub- 

w 

partitions contains roughly equal amounts of data (—). The 
cut position is derived based on the distribution of data ob¬ 
jects rather than splitting the space. However, finding an op¬ 
timal cut which generates an even partitioning requires sig¬ 
nificant computational effort. Furthermore, the algorithms 
also need to be judicious about the split position so that 
the number of boundary objects induced by such split is not 
very large. 

4. SPATIAL PARTITION ALGORITHMS 

4.1 Preliminaries 

We study the following partition problem: given a set of 
d-dimensional spatial objects R = {n, r 2 , ..ri..r n } (|R| = 
N), a partition algorithm partitions R into k partitions 
P = {pi , P 2 , ..pj..pk}, where each partition is size bounded 
|pj | < b, and the number of partitions k is minimized. With¬ 
out loss of generality, we consider the case where d = 2 
and a spatial object is approximated by its MBR (Minimum 
Bounding Rectangle), and each rectangle is represented by 
n = (xi,yi,Ui,Wi). 

Partitioning of one dimensional data (d = 1) has been 
extensively studied in the past, and it is shown that the op¬ 
timal solution can be obtained in polynomial time [17]. How¬ 
ever, for higher dimensions, even for a simple case d — 2, the 
problem becomes intractable. Previously, a simpler version 
of the problem, known as rectangle tiling, was studied. The 
main objective of rectangle tiling is to partition a matrix of 
integers into tiles, and it was proven to be NP-Hard [14, 19] 
for cases d > 2 . 

4.2 Methods and Details 

Fixed Grid Partitioning (FG). Fixed grid partitioning 
is a simple space-oriented, non-overlapping partitioning ap¬ 
proach in which the spatial universe is partitioned into k 
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Figure 2: Spatial partitions generated by different algorithms. The bigger rectangles in colors represent 
partition boundaries, and the small rectangles represent the spatial objects. 


equal sized grids. A major assumption behind this approach 
is that data follows a uniform distribution. Therefore, if the 
data distribution is close to a uniform distribution, Fixed- 
Grid is expected to generate a reasonably good partitioning. 
While the partition process is very simple, for the sake of 
clarity, details of this approach are described in Algorithm 
2 . 


Algorithm 2: Fixed grid partition (FG) 

Input: a set of spatial objects R 
Input: partition payload b 

1 m = \^J\R\/b); 

2 U = spatialUniverse(-R); 

3 G = split U into m by m grid; 

4 for n in R do 

5 g = grids intersects with r*; 

6 assign Ti to each grid in g; 

7 end 


Binary Split Partitioning (BSP). Binary split partition¬ 
ing is a top-down approach that generates partitions by 
recursively dividing a given spatial partition into two non¬ 
overlapping subpartitions until the payload requirement is 
met. Given a expected partition payload b, BSP recursively 
creates subpartitions if the number of objects inside a parti¬ 
tion exceeds the specified payload (Algorithm 3). The split 
point is chosen to be the median of object centroids in that 
partition. The direction of the split (horizontal or vertical) 
is dependent on the relative ratio of areas of subpartitions. 
The split direction is chosen so that the relative area dif¬ 
ference between children nodes are minimized based on a 
probabilistic expectation. 

Strip Partitioning (SLC). Strip partitioning is a non¬ 
overlapping, data oriented partitioning approach that has 
some resemblance to slicing a cake. In this approach, rather 
than defining a fixed space, we slice off a rectangular region 
from the spatial universe where each region contains approx¬ 
imately b objects. Then similar process is continued on the 
rest of the data and repeated until we generate all the par¬ 
titions. Details of this approach are described in Algorithm 

^Boundary Optimized Strip Partitioning (BOS). Algo¬ 
rithms described above do not explicitly consider the bound¬ 
ary object problem, although the partition payload is guar- 
enteed to be balanced. As a result, we may still get a par¬ 
titioning that are balanced but has a higher deduplication 
cost. BOS is a boundary object aware extension of SLC 
that minimizes the number of boundary objects while still 
generating a balanced partitioning. While performing the 


Algorithm 3 : Binary split partition (BSP) 

Input: a set of spatial objects R 
Input: partition payload b 

1 U = spatialUniverse(-R); 

2 while r in R do 

3 n = node([7); 

4 addObject(n, r); 

5 end 

6 function addObject(n,r): 
if n is leaf Node then 

s | n.objectList.add(r); 

9 end 

if size(n.objectList) < c then 

compute median-X and medianjy split ; 
split = argmax(Product of children areas); 

13 childl, child2 = children(n, split); 

14 if childl intersects with r then 

is | addObject(c/uldl, r); 

end 

if child2 intersects with r then 
| addObject(c/u/d2, r); 

end 
end 


strip based partitioning, BOS has two dimensions (d dimen¬ 
sions in general) to choose at each step. BOS calculates the 
partitioning in both dimensions, and selects the one which 
induces smaller number of boundary objects. Algorithm 5 
describes the details of this approach. 

Hilbert Curve Partitioning (HC). Space filling curves 
used in many application to obtain a locality preserving ap¬ 
proximate total ordering for multidimensional data. Com¬ 
monly used space filling curves include Z-curve, Gary-coded 
curve, and Hilbert curve. Among those approaches, Hilbert 
curve is shown [24] to have better clustering property for two 
dimensional objects. In our implementation, we use Hilbert 
curve to map the centroid of the spatial objects to obtain 
the curve value, and sort the dataset based on the curve 
value. Then, we group each consecutive b objects together 
to form a spatial partition, and the union of their extent is 
the final partition boundary. 

Sort-Tile-Recursive (STR) Partition. Packing spatial 
objects for bulk loading spatial index can be regarded as 
a “mini-partition” step. Most often the leaf nodes are pre¬ 
packed in order to generate low level nodes of the index, and 
higher level index nodes are constructed from the leaf nodes. 
Similarly, we can use packing algorithms to generated spatial 
partitions such that we only generate the lowest level index 
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Algorithm 4: Strip partition (SLC) 


Input: a set of spatial objects R 
Input: partition payload b 
Input: partition dimension d 

/* sort objects by mbr center in dimension d */ 

1 sort (R,d); 

2 U = spatialUniverse(-R); 

3 while R is not empty do 

4 s = cutStrip([7, R, b); 

5 for r; in R do 

6 if not ri intersects with s then 

7 | break; 

8 end 

9 assign n to partition s ; 

10 if s contains ri then 

n | remove ri from P; 

12 end 

13 end 


Vi end 


Algorithm 5: Boundary optimized strip partition 
(BOS) 


Input: a set of spatial objects R 
Input: partition payload b 
U = spatialUniverse(P); 
while R is not empty do 
// cost in dimension x 
cx = getCost([7, R, b, x); 

// cost in dimension y 

4 cy = getCost([7, R, b, y); 

5 if cx < cy then 

6 | strip partition in x dimension; 

7 end 

8 else 

9 | strip partition in y dimension; 

10 end 


n end 


nodes, and the node boundary serves as partition bound¬ 
ary. STR [21] first partitions the spatial universe into large 
vertical strips, then each strip is further partitioned in the 
horizontal direction. Algorithm 6 illustrates the partition 
process. 

Figure 2 shows a simple example in which a set of 32 
randomly distributed spatial objects are partitioned with 
different spatial partition algorithms we described above. 


5. SPATIAL PARTITION EFFICIENCY 

Optimal spatial partitioning is NP-hard. For spatial query 
processing tasks, performance of a query on an optimal par¬ 
tition layout may not be so different than the one on a sub- 
optimal partitioning. Therefore, finding a reasonably well 
partitioning in an efficient manner has practical implications 
for many real world applications. In a spatial data warehous¬ 
ing scenario, the underlying dataset is large and relatively 
stable, and queries run on the same dataset many times. 
In such case, an approach that produces a balanced parti¬ 
tioning but requires significant computational resources may 
be acceptable as it improves the query performance in the 
long run. However, in some other application scenarios such 


Algorithm 6: Sort-Tile-Recursive partition (STR) 

Input: a set of spatial objects R 
Input: partition payload b 

1 m = \s/\R\/b\, 

// m strips in dimension x 

2 S = stripPartition ( R , x)', 

3 for i 1 to m do 

// m strips in dimension y 

4 t = stripPartition (S']*], y); 

5 end- 

as scientific data exploration and simulation, queries con¬ 
sume large amounts of intermediate data that are generated 
quickly, and most queries run only once as the data being 
generated. In such cases, a fast partitioning algorithm is 
critical for achieving overall fast query response. In this pa¬ 
per we explore two different approaches towards improving 
spatial data partitioning efficiency, namely parallel spatial 
partitioning and partitioning with sampling. 

5.1 Parallel Partitioning with MapReduce 

Our parallelization approach is based on following two 
insights. First, spatial partition algorithms involve some 
kind of sorting based on a derived spatial value, and MapRe¬ 
duce can perform such task very efficiently. We can tweak 
the shuffle-and-sort phase of MapReduce to perform such 
task for (almost) free. Second, as different regions of a spa¬ 
tial dataset can be partitioned independently, rather than 
changing the algorithms for parallelization, we can run the 
partition algorithms on different regions of the dataset in 
parallel. Although the generated partition layout may be 
different from the one generated by a single thread parti¬ 
tioning program, it is acceptable as long as the partitioning 
is reasonably well. 


Algorithm 7: MapReduce based spatial partition 
Input: a set of spatial objects R 
Input: partition payload b 

1 S = sample_for_partitioning(R); 

2 function Map(fc,u): 

3 anchor = getAnchor(u); 

4 key = calculateKey(anc/ior,5); 

5 emit(fcey , v); 

/* shuffle and sort by MapReduce */ 

6 function Reduce {k,v) : 

/* partition the bucket with algorithm X */ 

7 P = genPartitionX(u); 

8 emit(P); 


We propose following approach for MapReduce-based par¬ 
allelization of spatial partition algorithms. First, similar to 
Hadoop Terasort [28], we sample the dataset to generate an 
anchor point list which will be utilized in the partition func¬ 
tion of MapReduce for partition assignment. In the Map 
phase we calculate a spatial ordering anchor, such as geo¬ 
metrical center or Hilbert Curve value, and generate a key 
based on the sample points generated previously. Next, the 
MapReduce framework will partition the objects into groups 
based on their anchor location and sorts them on the anchor 
value. At this point, dataset is roughly partitioned into large 
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spatial partitions. Later in section 6 , we will discuss issues 
related to this coarse level partitioning. In the reduce phase, 
each reducer will work on a single large partition, and fur¬ 
ther partitions them into smaller partitions. Algorithm 7 
gives a sketch of this approach. 

5.2 Partitioning on Sampled Data 

Efficiency of a partition algorithm is subject to a num¬ 
ber of factors such as algorithm runtime complexity, dataset 
characteristics and size. In relational database systems, sam¬ 
pling is used in various tasks to avoid full dataset process¬ 
ing. For example, typical histogram construction algorithms 
work on a small fraction of sampled data, thus avoiding the 
expensive full dataset statistics. Such approach is shown to 
be practical and efficient for query processing and dataset 
approximation. Therefore it is natural to ask that if we can 
generate a spatial partition schema on a sampled dataset, 
which reasonably approximates a full dataset partitioning. 

Specifically, given a sampling ratio 7 , we uniformly sam¬ 
ple the dataset to get a reduced dataset of size 7 |i?|, and run 
a partition algorithm on the this reduced dataset. Then, we 
map the generated partition layout onto the original dataset 
for final partition assignment and boundary object replica¬ 
tion. Sampling ratio is the main control variable in the 
sampling based approaches. If the sampling ratio is too 
low, the resulting partition quality may suffer. On the other 
hand, if the sampling ratio is unnecessarily high, the parti¬ 
tion efficiency may suffer while the partition quality is only 
marginally improved. We explore those issues later in Sec¬ 
tion 6 . 8 . 

One problem with sampling based partitioning is that 
some approaches fail to generate an effective spatial par¬ 
titioning on sampled dataset. For example, HC and STR. 
generates the partition regions that may not cover the en¬ 
tire spatial universe (Fig. 2(b) and 2(e)) and the partition 
region MBRs are tight. In such case, the resulting partitions 
from the sampled dataset can not be used without further 
fix. How to adapt those approaches for spatial partitioning 
on sampled dataset is a problem we are planning to explore 
in our future work. 

6. EXPERIMENTAL EVALUATION 

6.1 Experimental Setup 

We use Amazon EMR for our benchmarking tasks. For 
single thread benchmarking of partition algorithms, we use 
a large memory physical machine that comes with 128 GB 
memory. For spatial join query scalability tests, we use gen¬ 
eral purpose extra-large instance as our core and task nodes. 
Each extra-large instance is equipped with 15 GB memory, 
4 virtual cores and 4 disks with 1680 GB storage (4 x 420 
GB). The Amazon Machine Images (AMI) version we used 
for the cluster nodes is 3.0.2. Amazon S3 is used as the 
primary data storage for data serving. 

6.2 Datasets and Queries 

OpenStreetMap (OSM) Dataset. OpenStreetMap[l] is 
a collaborative mapping project that aims to create a free 
editable map of the world. It contains spatial representation 
of geometric features such as lakes, forests, buildings and 
roads. Spatial objects are represented by a specific data 
type such as points, lines, and polygons. We downloaded the 
dataset from the official website, and parsed it into WKT 


format for Hadoop-GIS to process. The table schema is 
fairly simple, and it has roughly 70 columns. We use the 
polygonal representation table which contains more than 87 
million spatial objects. 

Imaging (PI) Dataset. This dataset comes from an in- 
silico analysis of pathology images, by segmenting bound¬ 
aries of micro-anatomic objects such as nuclei and tumor 
regions, represented as polygons. Boundaries of spatial ob¬ 
jects are validated, normalized, and represented with WKT 
format. We have a set of 18 images (44GB) from a brain 
tumor study at Emory University Hospital. Each of these 
images contains 0.5 million spatial objects on average. 
Benchmark Query. We use a spatial join query to empir¬ 
ically evaluate the impact of a specific spatial partitioning 
on the query performance. Spatial join query is a very ex¬ 
pensive query type that used in many benchmarking tests 
to evaluate system performance. Processing of a spatial join 
query requires evaluation of the spatial predicate against all 
possible record pairs, and therefore is very time consuming. 

The spatial join predicate we used in the join query test is 
st_intersects, and the SQL expression of this query was 
illustrated in [6]. 

6.3 Parameters and Metrics 

Partition Payload. The two datasets, OSM and PI, are 
from different application domains, and they have different 
characteristics. Therefore, using the same parameter to par¬ 
tition both datasets may be problematic. For example, if we 
partition the smaller dataset with an expected payload of c 
- a perfect parameter for this dataset that yields best query 
performance, it might be a too fine granular partitioning for 
the larger dataset. To be able to make the results compara¬ 
ble, we define the partition payload relative to the dataset 
size. We use a wide range of fractions that will be multiplied 
with the dataset size to obtain the actual partition payload. 
Table 2 shows those numbers. 

| / | 0.001 | 0.005 | 0.01 | 0.02 | 0.05 | 0.1 | 0.2 | 0.5 | 1.0 | 5.0 | 

Table 2: Partition Parameter: Fraction (xl0~ 2 ) 

Boundary Object Ratio. We define a simple metric to 
study the relationship between partition granularity and par¬ 
tition quality in terms of boundary objects. For a dataset R 
that partitioned into k partitions P = {pi,P 2 , ..pi-.pk}, we 
define the boundary object ratio as: 


A is a real value that lies in the interval [0, 00 ). If a spatial 
partitioning does not induce any boundary objects, the value 
of A would be 0. 

6.4 Comparison of Partition Quality 

Before we evaluate the partition results with real queries, 
we present some statistical properties of the generated par¬ 
titions which can provide us insights on the partition algo¬ 
rithm behavior and quality. 

6.4.1 Partition Balance 

Fig. 3 shows standard deviation of generated partitions 
for different partition algorithms on two datasets. Here, 
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we use standard deviation as a measure of partition skew¬ 
ness. The horizontal axis represents the expected partition 
payload - a granularity value that we use to partition the 
datasets. The vertical axis represents the standard devia¬ 
tion of generated partition payloads. Two conclusions can 
be made from the figure. First, as the partition granular¬ 
ity increases, the skew tends to increase very quickly for all 
methods. Therefore, a very coarse level spatial partitioning 
should be avoided for parallel processing tasks that suffer 
from data skew. Second, not surprisingly, FG generates sig¬ 
nificantly skewed partitions compared to other approaches. 
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Figure 3: Standard deviation of partition results 

If we compare the same approach across two datasets with 
the same parameter setting, we will find that the partitions 
generated from the OSM dataset is more skewed than the 
partitions generated from the PI dataset. That means that, 
overall, the inherent skew in OSM is much severe than the 
PI dataset. Furthermore, the FG partitioning for PI dataset 
is considerably better than the FG partitioning of OSM 
dataset. Therefore, we can conclude that, for a evenly dis¬ 
tributed dataset, the FG approach can generate a reasonably 
well partitioning. However, if the dataset is highly skewed, 
FG approach may generate a very low quality partitioning. 

Adaptive approaches, such as STR, BOS and SLC, should 
be able to handle certain level of data skew as they can make 
smarter data oriented partition decision. We can see from 
the figures that corresponding lines for those approaches are 
relatively flat until the partition granularity gets large. How¬ 
ever, as the partitions get larger, the adaptability of those 
algorithms also approaches their limitations. 

One interesting result we did not expect to see is that 
partitions generated by HC approach are also as skewed as 
FG partitions, and for the PI dataset HC is not even as 
good as FG. As HC approach is a data oriented approach 
that traditionally used for bulk loading spatial indexes, it 
is surprising that the partitioning from HC has such high 
imbalance. 

6.4.2 Boundary Objects 

Fig. 4 shows the ratio of boundary objects generated by 
different algorithms. We can see the overall trend that, for 
both datasets, as the partition granularity increases the ra¬ 
tio of boundary objects decreases. FG seems to be a good 
algorithm if our main objective is to have less boundary ob¬ 
jects. However, as both figures show, a very fine granular 
partitioning is problematic as it significantly increases the 
dataset size, and in certain cases such increase can be dra¬ 
matic. For example, if we look at the A value for the first 
horizontal axis data point in Fig. 4 (a), for Strip partition¬ 
ing (SLC) the boundary object ratio is 1.86, whereas the 


Figure 4: Ratio of boundary objects 


same data point value is 16.1 in Fig. 4 (b). Such a large 
increase in data size is certainly not acceptable, and we can 
conclude that a very fine granular partitioning is not a prac¬ 
tical approach for large scale query processing. 

Interestingly, in both figures, the lines for the slicing ap¬ 
proaches, SLC and BOS, have higher slopes than other ap¬ 
proaches. This indicates that, for those partitioning algo¬ 
rithms, even a slight increase in the partition payload can 
contribute to significantly less number of boundary objects. 
Therefore, in practice, those partition methods should be 
configured to generate a relatively larger size partitions so 
that the number of boundary objects are reasonably small. 

6.5 Effects of Partitioning on Query Perfor¬ 
mance 

In this section, we empirically evaluate partition algo¬ 
rithms on different configurations to study how a specific 
partitioning affects the query performance, and investigate 
the relationship between partition granularity and query per¬ 
formance. The experiments are performed on a 50 node 
Amazon AWS MapReduce cluster, and general purpose AWS 
instances are used as compute nodes and storage nodes. 
Each experiment is conducted three times, and average of 
those three runs is used to account for performance varia¬ 
tions in cloud environment. 
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Figure 5: Spatial join query performance 


Fig. 5 shows the performance of the spatial join query 
on two datasets. The horizontal axis represents the parti¬ 
tion granularity, and vertical axis represents the query per¬ 
formance. Clearly, neither a very fine or very coarse par¬ 
titioning yields the optimal query performance. For a fine 
granular partitioning, the main cause can be attributed to 
the high boundary object ratio which not only increases the 
I/O overhead, but also the extra computation overhead. For 
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a very coarse granular partitioning, however, the root cause 
is the data skew between partitions. 

Recall that, in section 2, our cost analysis framework sug¬ 
gests that there is a point of optimal partition granularity 
that yields best query performance. The performance num¬ 
bers on both datasets support such case. As the figures show, 
overall, query performance is close to the optimal in mid¬ 
range of horizontal axis, and performance starts to degrade 
as the partition granularity increases. However, if we com¬ 
pare different algorithms over a wide range of partition gran¬ 
ularities, it is difficult to generalize such statement. Specifi¬ 
cally, BSP and STR have relatively better performance on a 
wider range of partition granularities, and the performance 
starts to suffer only after the partition granularity becomes 
too large. This can be attributed to the properties of these 
algorithms that they can adaptively handle data skew and 
boundary objects. 

Performance variance between datasets. In Fig. 5 (a), 
the performance of different approaches are tiered. FG and 
HC have similar performance, and their performance are al¬ 
most orders of magnitude worse than other approaches (due 
to the long query runtime, we only report one data point 
for FG). While performance of HC is still the worst on PI 
dataset as shown in Fig. 5 (b), performance of FG, however, 
is almost optimal for most cases. Clearly, specific character¬ 
istics of a dataset are contributing to such difference. Our 
observation indicates that PI dataset consists of large num¬ 
ber of small objects that are fairly evenly distributed across 
space, whereas OSM dataset consists of variety of objects of 
all sizes that are clustered around a number of hotspots. If 
we simply consult to the statistical propoerties from the pre¬ 
vious subsection 6.4, we can also see that FG partitioning 
of PI dataset is less skewed compared to the OSM dataset. 
Moreover, the number of boundary objects from FG parti¬ 
tioning is very small on all partition granularity. Due to 
those reasons, on PI dataset, FG partitioning achieves a bal¬ 
anced partitioning for “free”, and has an unfair advantage 
over other approaches. 

6.6 Partition Efficiency 

In this subsection we study the partition efficiency of dif¬ 
ferent algorithms. To perform a fair comparison, the time 
for reading the dataset from the disk, and writing the par¬ 
tition results to the disk is not included in the performance 
measurement. The performance time only includes the time 
for deriving the actual partition boundaries after the dataset 
is loaded into the main memory of a single machine. 
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Figure 6: Spatial partition performance 

Fig. 6 shows the runtime cost of partition algorithms on 
two datasets. Depending on the actual runtime performance, 
algorithms can be roughly categorized into three categories - 


fast (FG, BSP), medium (HC, STR), and slow (SLC, BOS). 
For both datasets, FG partition has the lowest runtime cost 
which is only in the range of milliseconds, and BSP has the 
second best performance. However, other four algorithms 
require considerable amounts of time to generate partitions. 
Specifically, the space slicing approaches - SLC and BOS, 
require more than an hour to derive a partitioning on OSM. 
This is mainly due to the nature of the algorithms that SLC 
and BOS not only sort the dataset on one dimension, they 
also perform lots of boundary object examination. The main 
cost of HC is the Hilbert Curve calculation and sorting based 
on the curve value. The performance of the algorithms on 
different datasets is roughly similar, with the exception of 
HC that has a slightly slower performance on the PI dataset 
compared to the OSM. 
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Figure 7: Spatial partition performance variance 

Figure 7 shows the runtime performance of the algorithms 
over different partition granularity. While the performance 
of the algorithms do not depend too much on the partition 
granularity, there are noticiable differences. Intuitively, a 
finer granularity partitioning entails more epu cycles, and 
therefore it is expected that algorithms run slower for small 
payload values. Performance numbers of FG and BSP show 
such tendency. However, depending on the algorithm and 
dataset characteristics such hypothesis may not hold true. 
For example, the main cost in HC partitioning comes from 
calculating and sorting the spatial objects based on the 
Hilbert curve value. Regardless of partition granularity, such 
cost is constant. Therefore, as the figure shows, performance 
of HC does change with partition granularity. Interestingly, 
STR has lightly degraded performance on a larger partition 
granularity on OSM dataset. The specific reasons are not 
completely clear to us, and we are planning to investigate 
such problem in future work. 

If we compare relative performance of the algorithms across 
the two datasets, the lines for PI dataset is more smooth and 
predictable. For example, on OSM dataset, SLC and BOS 
have an irregular runtime performance over different parti¬ 
tion payloads. However, those algorithms do not exhibit the 
same behavior in the PI dataset. Given the dataset charac¬ 
teristics we discussed earlier, we can conclude that dataset 
characteristics have implications for the algorithm perfor¬ 
mance. 

6.7 Parallel Partitioning with MapReduce 

Spatial partitioning is a time consuming process, and, as 
the performance numbers in previous subsection show, it 
may take hours. This has motivated us to develop MapRe¬ 
duce based spatial partitioning for improved query perfor¬ 
mance and spatial ETL process. To test efficiency and 
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scalability of our MapReduce based parallel partitioning ap¬ 
proach, we modified and tested selected set of four parti¬ 
tioning algorithms, namely BSP, SLC, BOS and STR. The 
rationale in such selection is that, (1) parallelization of FG 
and HC is straightforward, and (b) they generate subopti- 
mal partitioning in most cases. Here, we select a set of three 
expensive spatial partitioning approaches (SLC, BOS, STR) 
to experiment. While BSP is reasonably fast, we also in¬ 
clude it in our experiments to compare its performance with 
other approaches. Experiments are also performed on the 
Amazon EMR, and unlike the performance measurement in 
previous subsection, here the runtime performance includes 
both I/O cost and computation cost. 



(a) scalability 



(b) performance 


Figure 8: Parallel partitioning performance 


as the top level partitioning granularity gets coarser, the 
performance gets better. Our profiling of the parallel algo¬ 
rithms provides folowing explanation. Like Terasort [28] , the 
parallelization algorithms use a sampled data file for assign¬ 
ing the spatial objects into separate partition groups which 
has a global total ordering. In a finer granularity top level 
spatial partitioning, the total order based partition group as¬ 
signment becomes the major bottleneck. Interestingly, the 
visualization of the partition boundaries show that spatial 
partition results from a larger top level partitioning has more 
resemblance to the partition results from a single threaded 
approach. 

6.8 Spatial Partitioning with Sampling 

Fig. 9 shows a statistical evaluation of three sampling 
based partitioning approaches on the OSM dataset. The 
figures on the left column show the standard deviation - 
measure of skewness - of generated partitions, and the fig¬ 
ures on the right column show boundary object ratio. The 
full dataset is sampled with different sampling rate (shown 
in the legend of the figures), the resulting partitions from 
the sampled dataset are compared against the the partition¬ 
ing generated from the full dataset. The sampling rate of 
1.0 represents full dataset partitioning. From the figures we 
can see that sampling can be a very effective approach for 
spatial partitioning. 


Fig. 8 (a) shows a scalability chart for the three MapRe¬ 
duce based parallel partitioning approaches on OSM dataset. 
The horizontal axis represents the number of nodes used for 
parallelization, and the vertical axis represents the parti¬ 
tion runtime. The performance is measured with a top level 
coarse partition granularity of 500000. While this number 
seems to be arbitrary, our experiments show that the scala¬ 
bility is not affected by the coarse partitioning granularity. 
As the figure shows, the MapReduce based partitioning ap¬ 
proach is very scalable and efficient. With the increased 
cluster capacity, the runtime performance improves almost 
linearly. With parallelization, the partition efficiency of the 
algorithms increased by an order of magnitude. For example, 
the runtime of BOS decreased from 4000 seconds to merely 
300 seconds. Although the algorithms have very different 
runtime performance on a single thread implementation, the 
performance after parallelization seems to be homogeneous. 

Recall that our parallelization algorithm performs parti¬ 
tioning in two steps. The top level coarse partitioning for 
parallel partitioning, and bottom level partitioning in which 
the coarse partitions are re-partitioned with specific spatial 
partition algorithms. Each step involves a partitioning gran¬ 
ularity parameter which controls partition size. To study 
the effects of those parameters on parallel partitioning per¬ 
formance, we perform two seperate experiments. In the first 
experiment, we fix the coarse top level partitioning granular¬ 
ity and test the runtime performance with different bottom 
level partitioning granularity. Not surprisingly, the perfor¬ 
mance difference between different parameters are too little 
to be significant, and consequently we can conclude that the 
bottom level partitioning granularity has no noticeable effect 
on parallel partitioning performance. 

In the second experiment, we fix the bottom level parti¬ 
tioning, and change the top level partition granularity. Fig¬ 
ure 8 (b), shows performance variations of parallel parti¬ 
tioning for different partition granularity. We can see that 
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Figure 9: Quality of partitions generated by sam¬ 
pling based approaches 

Intuitively, higher the sampling rate, the better we can 
preserve data distribution, and consequently the partition- 
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ing on the sampled dataset is of higher quality. If we look at 
the figures on the left column, we can see that partitions gen¬ 
erated with higher sampling rate are less skewed compared 
to lower sampling rate partitioning. However, depending on 
the algorithm, partition skew can be different. For example, 
as BSP implicitly try to avoid a skewed partitioning, the 
impact of higher sampling rate is not significant. Whereas 
in SLC and BOS, higher sampling rate seems to be always 
beneficial. There is a minor exception to this case. Specifi¬ 
cally, in SLC and BOS, if the partition payload is reasonably 
large, sampling based approaches can generate a less skewed 
partitioning than the full dataset partitioning. This is partic¬ 
ularly interesting, and it has important implications for cer¬ 
tain application scenarios. First, by using a sampling based 
approach we can significantly reduce the partition time. Sec¬ 
ond, aside from the improved performance, we can actually 
obtain a less skewed partitioning with the minor limitation 
of large partition size. Interestingly, the ratio of boundary 
objects generated by sampling based partition approaches 
is not completely dependent on the sampling ratio. Overall, 
the sampling based partitioning approaches generate more 
boundary objects compared to the full dataset partitioning, 
although the variation is not significant. 

7. RELATED WORK 

To the best of of our knowledge, this is the first work that 
studies the spatial data partition problem in detail. Data 
partition problem is discussed extensively in the context of 
database systems in the last few decades [8, 33]. Fixed grid 
partition and its variations are used for spatial join process¬ 
ing in [29, 39]. Le et al. [20] studied the problem of find¬ 
ing optimal splitters for large interval data. More recently, 
MapReduce based systems emerged as an effective solution 
to Spatial Big Data challenges [13]. HadoopGIS [6] is a 
spatial data warehousing system that is based on a general 
spatial query processing framework. The system uses SQL 
as the query language, and integrated into Hive[37]. Spa- 
tialHadoop [12] is an extension of Hadoop for spatial query 
processing, and it extends Pig [11] at the query language 
layer. Ray et al. [30] proposed a spatial data analysis in¬ 
frastructure that uses a combination of cloud environment 
and relational database systems. Authors also briefly dis¬ 
cussed a hybrid approach that uses Hilbert Curve and space 
partitioning for spatial join processing. 

Spatial histogram construction is extensively studied in 
database settings, and it is widely used for approximate 
query processing. The main goal of spatial histogram con¬ 
struction is to partition the multi-dimensional data into buck¬ 
ets (most often a bucket represents a rectangular region), 
where data within buckets is uniformly distributed. In that 
sense, spatial histogram generation is relevant to spatial par¬ 
titioning, but not the same. In [26], authors have showed 
that computing the non-overlapping rectangular partition¬ 
ing with near-uniform data distribution within buckets is 
NP-hard. One of the pioneering works is [25], in which 
authors proposed to extend the concept of equi-depth his¬ 
togram to multidimensional data. An in-memory data struc¬ 
ture hTree is designed for storing the histograms. It con¬ 
structs non-overlapping partitioning of multidimensional space 
based on object frequencies. However location of objects 
are not considered for histogram construction, which may 
result in skewed histograms. MinSkew histogram [3] is pro¬ 
posed to remedy some of the disadvantages of hTree. Specifi¬ 


cally, the authors proposed two construction strategies. The 
first approach has two phases. In the first phase, the algo¬ 
rithm tiles the spatial universe into uniform regular grids 
and stores the number of intersecting spatial objects for 
each tile. Then based on the tiling, a recursive binary space 
partitioning (BSP) is used for histogram construction. Au¬ 
thors experimentally observed that a fixed-size tiling is sen¬ 
sitive to the size of the queries (high grid resolution favors 
small sized queries and vice versa), and proposed another ap¬ 
proach MinSkew-Progressive-Refinement which can utilize 
multi-resolution tiling. 

Spatial histogram construction is extensively studied in 
spatial database settings, and it is widely used for approxi¬ 
mate query processing. The main goal of spatial histogram 
construction is to partition the multi-dimensional data into 
buckets (most often a bucket represents a rectangular re¬ 
gion), where data within buckets is uniformly distributed. 
GenHist [15] is a recent approach which can identify high 
density regions for real valued attributes. However, in Gen- 
Hist bucket rectangles may overlap, and the buckets can 
be contained in other buckets. It uses a fixed-size grid as 
the basis of histogram construction. More recently, an ap¬ 
proach called STHist [31] is proposed to generate density 
aware histograms. In the basic STHist algorithm, decision 
about whether the region is dense is made by applying a 
sliding window over all dimensions, by approximating the 
frequency distribution by a marginal distribution. In the 
advanced variant called STForest, the algorithm first com¬ 
putes coarse partitions according to the object skew, and 
then applies a sliding window algorithm to them. STHist 
has a time complexity of 0 (n 2 ) for 2-dimensional and 0 (n 3 ) 
for 3-dimensional data. 

A convenient approach to obtain a spatial histogram is to 
generate it using a spatial index structure like R-Tree [16], 
R*-Tree [7], R-l—Tree [34] etc. RK-Hist [10] is an example 
of such approach which is based on R-tree bulk-loading pro¬ 
cedure. The data is presorted according the Hilbert space¬ 
filling-curve. After the leaf nodes are generated, a histogram 
can be generated by packing nodes according to the sorting 
order in equi-sized histogram buckets. However, this may 
not necessarily generate a good partitioning. Specifically, 
for approximately uniformly distributed data equi-sized par¬ 
titioning wastes buckets for regions with a high object den¬ 
sity and produces high overlap between buckets. Therefore, 
the authors proposed a greedy algorithm utilizing a sliding 
window of pages along the Hilbert order. The algorithm is 
parametrized with a number of buckets that should be con¬ 
sidered for a split. A bucket-split is applied if it leads to an 
improvement according to the proposed cost function. More 
recently, a new approach R-V [2] is proposed to overcome 
skewed-data distribution problem. 

8. CONCLUSION 

A proper spatial partitioning schema is essential for op¬ 
timal query performance and system efficiency for scalable 
distributed spatial query processing. In this paper, we for¬ 
mally introduce the spatial partition problem, and present a 
comprehensive study of six different partitioning algorithms. 
We categorize the algorithms along three dimensions, and 
provide a systematic evaluation of the algorithms on two 
real world datasets from different domains. We also propose 
parallelization methods to improve the efficiency of spatial 
data partition process, and explore sampling based partition- 
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ing as an alternative for fast spatial partitioning. Our study 
reveals several insights on how partitioning effects query per¬ 
formance and what factors should be considered for effective 
spatial partitioning. The results provide practical guidelines 
for designing spatial partitioning for large scale parallel spa¬ 
tial query processing. 
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